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RATES OF CONVERGENCE IN ACTIVE LEARNING 

By Steve Hanneke^ 

Carnegie Mellon University 

We study the rates of convergence in generalization error achiev- 
able by active learning under various types of label noise. Addition- 
ally, we study the general problem of model selection for active learn- 
ing with a nested hierarchy of hypothesis classes and propose an al- 
gorithm whose error rate provably converges to the best achievable 
error among classifiers in the hierarchy at a rate adaptive to both 
the complexity of the optimal classifier and the noise conditions. In 
particular, we state sufficient conditions for these rates to be dramat- 
ically faster than those achievable by passive learning. 

1. Introduction. Active learning refers to a family of powerful supervised 
learning protocols capable of producing more accurate classifiers while using 
a smaller number of labeled data points than traditional (passive) learning 
methods. Here we study a variant known as pool-based active learning, in 
which a learning algorithm is given access to a large pool of unlabeled data 
(i.e., only the covariates are visible), and is allowed to sequentially request 
the label (response variable) of any particular data points from that pool. 
The objective is to learn a function that accurately predicts the labels of 
new points, while minimizing the number of label requests. Thus, this is a 
type of sequential design scenario for a function estimation problem. This 
contrasts with passive learning, where the labeled data are sampled at ran- 
dom. In comparison, by more carefully selecting which points should be 
labeled, active learning can often significantly decrease the total amount of 
effort required for data annotation. This can be particularly interesting for 
tasks where unlabeled data are available in abundance, but label information 
comes only through significant effort or cost. 
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Recently, there have been a series of exciting advances on the topic of ac- 
tive learning with arbitrary classification noise (the so-called agnostic PAC 
model [22]), resulting in several new algorithms capable of achieving im- 
proved convergence rates compared to passive learning under certain condi- 
tions. The first, proposed by Balcan, Beygelzimer and Langford [6] was the 
A"^ (agnostic active) algorithm, which provably never has significantly worse 
rates of convergence than passive learning by empirical risk minimization. 
This algorithm was later analyzed in detail in [19], where it was found that 
a complexity measure called the disagreement coefficient characterizes the 
worst-case convergence rates achieved by for any given hypothesis class, 
data distribution and best achievable error rate in the class. The next major 
advance was by Dasgupta, Hsu and Monteleoni [14], who proposed a new 
algorithm, and proved that it improves the dependence of the convergence 
rates on the disagreement coefficient compared to A"^. Both algorithms are 
defined below in Section 3. While all of these advances are encouraging, they 
are limited in two ways. First, the convergence rates that have been proven 
for these algorithms typically only improve the dependence on the magni- 
tude of the noise (more precisely, the noise rate of the hypothesis class), 
compared to passive learning. Thus, in an asymptotic sense, for nonzero 
noise rates these results represent at best a constant factor improvement 
over passive learning. Second, these results are limited to learning with a 
fixed hypothesis class of limited expressiveness, so that convergence to the 
Bayes error rate is not always a possibility. 

On the first of these limitations, recent work by Castro and Nowak [12] 
on learning threshold classifiers discovered that if certain parameters of the 
noise distribution are known (namely, parameters related to Tsybakov's mar- 
gin conditions), then we can achieve strict improvements in the asymptotic 
convergence rate via a specific active learning algorithm designed to take 
advantage of that knowledge for thresholds. Subsequently, Balcan, Broder 
and Zhang [7] proved a similar result for linear separators in higher di- 
mensions, and Castro and Nowak [12] showed related improvements for the 
space of boundary fragment classes (under a somewhat stronger assumption 
than Tsybakov's). However, these works left open the question of whether 
such improvements could be achieved by an algorithm that does not ex- 
plicitly depend on the noise conditions (i.e., in the agnostic setting), and 
whether this type of improvement is achievable for more general families of 
hypothesis classes, under the usual complexity restrictions (e.g., VC class, 
entropy conditions, etc.). In a personal communication, John Langford and 
Rui Castro claimed achieves these improvements for the special case of 
threshold classifiers (a special case of this also appeared in [9]). However, 
there remained an open question of whether such rate improvements could be 
generalized to hold for arbitrary hypothesis classes. In Section 4, we provide 
this generalization. We analyze the rates achieved by A"^ under Tsybakov's 
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noise conditions [26, 28]; in particular, we find that these rates are strictly 
superior to the known rates for passive learning, when the disagreement 
coefficient is finite. We also study a novel modification of the algorithm of 
Dasgupta, Hsu and Monteleoni [14], proving that it improves upon the rates 
of yl^ in its dependence on the disagreement coefficient. 

Additionally, in Section 5, we address the second limitation by proposing 
a general model selection procedure for active learning with an arbitrary 
structure of nested hypothesis classes. If the classes have restricted expres- 
siveness (e.g., VC classes), the error rate for this algorithm converges to the 
best achievable error by any classifier in the structure, at a rate that adapts 
to the noise conditions and complexity of the optimal classifier. In general, 
if the structure is constructed to include arbitrarily good approximations to 
any classifier, the error converges to the Bayes error rate in the limit. In par- 
ticular, if the Bayes optimal classifier is in some class within the structure, 
the algorithm performs nearly as well as running an agnostic active learning 
algorithm on that single hypothesis class, thus preserving the convergence 
rate improvements achievable for that class. 

2. Definitions and notation. In the active learning setting, there is an 
instance space X, a label space y = {— 1,+1} and some fixed distribution 
T>xY over X xy, with marginal T>x over X. The restriction to binary classi- 
fication (3^ = { — 1,+1}) is intended to simplify the discussion; however, ev- 
erything below generalizes quite naturally to multiclass classification (where 

y = {i,2,...,k}). 

There are two sequences of random variables: Xi,X2, ■ ■ ■ and Yi,Y2, . . . , 
where each {Xi,Yi) pair is independent of the others, and has joint distribu- 
tion T>xY- However, the learning algorithm is only permitted direct access 
to the Xi values (unlabeled data points), and must request the Yi values one 
at a time, sequentially. That is, the algorithm picks some index i to observe 
the Yi value, then after observing it, picks another index i' to observe the 
Yi' label value, etc. We are interested in studying the rate of convergence 
of the error rate of the classifier output by the learning algorithm, in terms 
of the number of label requests it has made. To simplify the discussion, we 
will think of the data sequence as being essentially inexhaustible, and will 
study (1 — 5)-confidence bounds on the error rate of the classifier produced 
by an algorithm permitted to make at most n label requests, for a fixed 
value 5 S (0,1/2). The actual number of (unlabeled) data points the algo- 
rithm uses will be made clear in the proofs (typically close to the number of 
points needed by passive learning to achieve the stated error guarantee). 

A hypothesis class C is any set of measurable classifiers h:X^y. We 
will denote by d the VC dimension of C (see, e.g., [11, 15, 30-32]). For 
any measurable h:X ^y and distribution T) over X xy, define the error 
rate of h as erj){h) = P(x,y)~X'{^(-'^) /^}; when 'D = 'Dxy, we abbreviate 
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this as er{h). This simply represents the risk under the 0-1 loss. We also 
define the conditional error rate, given a set ii C A', as er{h\R) = T{h{X) ^ 
Y\X £ R}. Let V = inf/igc ^^{h), called the noise rate of C. For any x ^ X , 
let r]{x) = P{y = l\X = x}, let h*{x) = 21[r]{x) > 1/2] - 1 and let u* = 
er{h*). We call h* the Bayes optimal classifier and i'* the Bayes error rate. 
Additionally, define the diameter of any set of classifiers V as diam(l/) = 
sup/jj )^^^yW{hi{X) 7^ h2{X)}, and for any e > 0, define the diameter of the 
e-minimal set of V as diam(e; F) = diam({/i E F : er(/i) — inf/^/gy er{h') < 

For a classifier h, and a sequence S = {{xi,yi),{x2,y2), ■ ■ ■ ,{xm,ym)} G 
(A' X y)"^, let ers{h) = X](a; j/)e5 -''-[^(^) ^] denote the empirical er- 
ror rate on S", [and define er|}(/i) = by convention]. It will often be 
convenient to make use of sets of (index, label) pairs, where the index is 
used to uniquely refer to an element of the {Xi} sequence (while conve- 
niently also keeping track of relative ordering information); in such con- 
texts, we will overload notation as follows. For a classifier /i, and a fi- 
nite set of (index, label) pairs S = {(ii,yi), (^2,^/2), ■ • ■ , {im,Vm)] C N x 3^, 
let ers{h) = ^ j^)g5 l[/i(-'^i) T^y], (and er|}(/i) = 0, as before). Thus, 
ers{h) = ers'{h), where 5' = {(-'^i, y)}(i,j/)es- For the indexed true label se- 
quence, Z^^^ = {{l,Yi),{2,Y2), . . . ,{m,Ym)}, we abbreviate this erm{h) = 
er2(m){h), the empirical error on the first m data points. 

In addition to the independent interest of understanding the rates achiev- 
able here, another primary interest in this setting is to quantify the achiev- 
able improvements, compared to passive learning. In this context, a passive 
learning algorithm can be formally defined as a function mapping the se- 
quence {{Xi,Yi), (X2,l2), ■ ■ ■ , {Xn,Yn)} to a classificr /i„,; for instance, per- 
haps the most widely studied family of passive learning methods is that of 
empirical risk minimization (e.g., [23, 27, 30, 31]), which return a classi- 
fier hn G argmin/jgc srn{h). For the purpose of this comparison, we review 
known results on passive learning in several contexts below. 

2.1. Tsybakov's noise conditions. Here we describe a particular parametriza- 
tion of noise distributions, relative to a hypothesis class, often referred to 
as Tsybakov's noise conditions [26, 28], or margin conditions. These noise 
conditions have recently received substantial attention in the passive learn- 
ing literature, as they describe situations in which the asymptotic minimax 
convergence rate of passive learning is faster than the worst case rate 
(e.g., [23, 26-28]). 

Condition 1. There exist finite constants // > and k>1, s.t. Ve > 0, 
diam(e; C) < fxe^^'^. 
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This condition is satisfied when, for example, 



3^'>0,K>1 s.t. 3/igC:V/i'gC er{h') - u > fi'F{h{X) ^ h' {X)Y , 
[23]. It is also satisfied when the Bayes optimal classifier is in C and 
V' >0,a G (0,oo) s.t. Ve>0 H\r]{X) - 1/2| < e} < 



where k and ^ are functions of a and /i" [26, 28]; in particular, k = {l + a)/a. 
As we will see, the case where k = 1 is particularly interesting; for instance, 
this is the case when /i* G C and P{|r/(X) — 1/2| > c} = 1 for some constant 
c G (0, 1/2). Informally, in many cases Condition 1 can be realized in terms of 
the relation between magnitude of noise and distance to the optimal decision 
boundary; that is, since in practice the amount of noise in a data point's 
label is often inversely related to the distance from the decision boundary, 
a small k value may often result from having low density near the decision 
boundary (i.e., large margin); when this is not the case, the value of k is 
often determined by how quickly r]{x) changes as x approaches the decision 
boundary. See [7, 12, 23, 26-28] for further interpretations of this condition. 

It is known that when this condition is satisfied for some k > 1 and ^ > 
0, the passive learning method of empirical risk minimization achieves a 
convergence rate guarantee, holding with probability > 1 — (5, of 



where c is a (k and /x-dependent) constant (this follows from [23, 27]; see 
Appendix B of the supplementary material [20], especially (17) and Lemma 
5, for the details). Furthermore, for some hypothesis classes, this is known 
to be a tight bound (up to the log factor) on the minimax convergence 
rate, so that there is no passive learning algorithm for these classes for 
which we can guarantee a faster convergence rate, given that the guarantee 
depends on Vxy only through fi and k, [12, 28] (see also Appendix D of the 
supplementary material [20]). 

2.2. Disagreement coefficient. The disagreement coefficient, introduced 
in [19] , is a measure of the complexity of an active learning problem, which 
has proven quite useful for analyzing the convergence rates of certain types 
of active learning algorithms: for example, the algorithms of [6, 10, 13, 14]. 
Informally, it quantifies how much disagreement there is among a set of 
classifiers relative to how close to some h they are. The following is a version 
of its definition, which we will use extensively below. For any hypothesis class 




C and y C C, let 



DlS(y) = {xeX: 3hi,h2 G V s.t. hi{x) / h2{x)}. 
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For r G [0, 1] and measurable h:X^y, let 

B{h, r) = {h' eC: ¥{h{X) / h'{X)} < r}. 



Definition 1 . The disagreement coefficient of h with respect to C under 
Vx is defined as 

P(DIS(i?(^,r))) 
dh = sup , 

r>ro T 

where tq = (though see Appendix A.l for alternative possibilities for tq). 

Definition 2. We further define the disagreement coefficient for the 
hypothesis class C with respect to the target distribution T)xy as = 
liminffc_j,oo , where {/i^'^l} is any sequence in C with er{h}-^^) monoton- 
ically decreasing to v] [by convention, take every G argmin/igc e.r{h) if 
the minimum is achieved]. 

In Definition 1, it is conceivable that DIS(S(/i,r)) may sometimes not 
be measurable. In such cases, we can define P(DIS(i?(^, r))) as the outer 
measure [29], so that it remains well defined. We continue this practice be- 
low, letting P and E (and indeed any reference to "probability") refer to 
the outer expectation and measure in any context for which this is neces- 
sary. 

Because of its simple intuitive interpretation, measuring the amount of 
disagreement in a local neighborhood of some classifier h, the disagreement 
coefficient has the wonderful property of being relatively simple to calculate 
for a wide range of learning problems, especially when those problems have 
a natural geometric representation. To illustrate this, we will go through a 
few simple examples from [19]. 

Consider the hypothesis class of thresholds hz on the interval [0, 1] [for 
z G (0, 1)], where hz{x) = iff x > z. Furthermore, suppose Vx is uniform 
on [0, 1]. In this case, it is clear that the disagreement coefficient is 2, since 
for sufficiently small r, the region of disagreement of B{hz,r) is [z — r,z + r), 
which has probability mass 2r. In other words, since the disagreement region 
grows with r in two disjoint directions, each at rate 1, we have 9h_, = 2. 

As a second example, consider the disagreement coefficient for intervals on 
[0, 1]. As before, let X = [0, 1] and Vx be uniform, but this time C is the set of 
intervals h^afi] such that for x G [0, 1], h^a,b]{x) = +1 iff x G [a, 6] (for < a < 
6 < 1). In contrast to thresholds, the disagreement coefficients Oh^^ ^ for the 
space of intervals vary widely depending on the particular /i[a,fe] • Specifically, 
we have 6^^^ = max{ j^^, 4}. To see this, note that when < r < 6 — a, every 
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interval in B{h[^i^j,r) has its lower and upper boundaries within r of a and 
b, respectively; thus, P(DIS(-B(/i[a , r))) < 4r, with equality for sufficiently 
small r. However, when r > b — a, every interval of width <r — (b — a) is in 
5(/i[,,b],r), so that P(DIS(5(/i[,,b],r))) = 1. 

As a slightly more involved example, [19] studies the scenario where X is 
the surface of the origin-centered unit sphere in M'' for d > 2, C is the space 
of all linear separators whose decision surface passes through the origin, and 
Dx is the uniform distribution on X; in this case, it turns out V/i G C the 
disagreement coefficient 9h satisfies 

jy/d<9h < nVd. 

The disagreement coefficient has many interesting properties that can 
help to bound its value for a given hypothesis class and distribution. We list 
a few elementary properties below. Their proofs, which are quite short and 
follow directly from the definition, are left as easy exercises. 



Lemma 1 (Close marginals [19]). Suppose 3X £ (0,1] s.t. for any mea- 
surable set A<ZX, XFj)^ (A) < Fj,>^ (A) < ^Px,^- (A). Let h:X be a mea- 
surable classifier, and suppose Oh and O'j^ are the disagreement coefficients for 
h with respect to C under Dx and T>'-^, respectively. Then 



Lemma 2 (Finite mixtures). Suppose 3a S [0, 1] s.t. for any measurable 
set ACX, ¥x>j, (A) = aFx>^ (^) + (1 - a)Fx>2 {A) . For a measurable h:X ^ 
y , let be the disagreement coefficient with respect to C under Pi, 
be the disagreement coefficient with respect to C under T)2, and Oh be the 
disagreement coefficient with respect to C under T>x ■ Then 

Lemma 3 (Finite unions). Suppose /i € Ci n C2 is a classifier s.t. the 

disagreement coefficient with respect to Ci under T>x is O^f^^ and with respect 
(2) 

to C2 under T>x is Oj^ . Then if Oh is the disagreement coefficient with respect 
to C = Ci U C2 under 'Dx, we have that 

max{e«,^f}<0,<0«+0f. 
In fact, even i/ /i ^ Ci n C2, we still have Oh < ^^^^ + 0^^^ + 2. 
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See [8, 10, 14, 16, 19, 33] for further discussions of various uses of the dis- 
agreement coefficient and related notions and extensions in active learning. 
In particular, Friedman [16] proves that any hypothesis class and distribu- 
tion satisfying certain general regularity conditions will admit finite constant 
bounds on 9. Also, Wang [33] bounds the disagreement coefficient for certain 
nonparametric hypothesis classes, characterized by smoothness of their deci- 
sion surfaces. Additionally, Beygelzimer, Dasgupta and Langford [10] present 
an interesting analysis using a natural extension of the disagreement coeffi- 
cient to study active learning with a larger family of loss functions beyond 
0-1 loss. 

The disagreement coefficient has deep connections to several other quan- 
tities, such as doubling dimension [25] and VC dimension [30]. Additionally, 
a related quantity, referred to as the "capacity function," was studied in the 
1980s by Alexander in the passive learning literature, in the context of ratio- 
type empirical processes [2-4] and recently was further developed by Gine 
and Koltchinskii [17]; interestingly, in this latter work, Gine and Koltchin- 
skii study a localized version of the capacity function, which in our present 
context can essentially be viewed as the function T(r) =P(DIS(i?(/i,r)))/r, 
so that Oh = sup^yj.^ Tif)- 

3. General algorithms. We begin the discussion of the algorithms we will 
analyze by noting the underlying inspiration that unifies them. Specifically, 
at this writing, all of the published general-purpose agnostic active learn- 
ing algorithms achieving nontrivial improvements are derivatives of a basic 
technique proposed by Cohn, Atlas and Ladner [13] for the realizable active 
learning problem. Under the assumption that there exists a perfect classifier 
in C, they proposed an algorithm which processes unlabeled data points in 
sequence, and for each one it determines whether there is a classifier in C 
consistent with all previously observed labels that predicts +1 for this new 
point and one that predicts —1 for this new point; if so, the algorithm re- 
quests the label, and otherwise it does not request the label; after n label 
requests, the algorithm returns any classifier consistent with all observed 
labels. In some sense, this algorithm corresponds to the very least we could 
expect of an active learning algorithm, as it never requests the label of a 
point it can derive from known information, but otherwise makes no effort 
to search for informative data points. The idea is appealing, not only for its 
simplicity, but also for its extremely efficient use of unlabeled data; in fact, 
under the stated assumption, the algorithm produces a classifier consistent 
with the labels of all of the unlabeled data it processes, including those it 
does not request the labels of. 

We can equivalently think of this algorithm as maintaining two sets: V C 
C is the set of candidate hypotheses still under consideration, and R = 
DlS(y) is their region of disagreement. We can then think of the algorithm 
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as requesting a random labeled point from the conditional distribution of 
T^XY given that X & R, and subsequently removing from V any classifier 
inconsistent with the observed label. A formal definition of the algorithm is 
given as follows. 



Algorithm 

Input: hypothesis class C, label budget n 
Output: classifier G C 

1. For m = 1,2, . . . 

2. UXm£DlS{Vt), 

3. Request Ym 

4. t^t + l 

5. Vt^{hGVt^i:h{Xm)=Y^} 

6. If t = n or {m' > m : X^/ G DlS(Vt)} = 0, Return any /i„ G Vt 



The algorithms described below for the problem of active learning with 
label noise each represent noise-robust variants of this basic idea. They work 
to reduce the set of candidate hypotheses, while only requesting the labels of 
points in the region of disagreement of these candidates. The trick is to only 
remove a classifier from the candidate set once we have high statistical con- 
fidence that it is worse than some other candidate classifier so that we never 
remove the best classifier. However, the two algorithms differ somewhat in 
the details of how that confidence is calculated. 

3.1. Algorithm 1. The first noise-robust algorithm we study, originally 
proposed by Balcan, Beygelzimer and Langford [6], is typically referred to 
as for Agnostic Active. This was historically the first general-purpose ag- 
nostic active learning algorithm shown to achieve improved error guarantees 
for certain learning problems in certain ranges of n and i^. Below is a variant 
of this algorithm. It is defined in terms of two functions: UB and LB. These 
represent upper and lower confidence bounds on the error rate of a classifier 
from C with respect to an arbitrary sampling distribution, as a function of 
a labeled sequence sampled according to that distribution. Some steps in 
the algorithm require calculating certain probabilities, such as P(DIS(y)) 
or P(-R); later, we discuss replacing these with appropriate estimators. 
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Algorithm 1 

Input: hypothesis class C, label budget n, confidence 5, functions UB 
and LB 

Output: classifier /i„ 

0. V^C,R^ DIS(C), Q ^ 0, m ^ 

1. For t= l,2,...,n 

2. If P(DIS(y)) < iP(i2) 

3. R^Blslv); Q^0 

4. If F{R) < 2"", Return any KeV 

5. m ^ min{m' > m : X^i S R} 

6. Request Ym and let Q QU {{m, Ym)} 

7. V^{h£V: LB{h, Q, 5/n) < minfev UB{h', Q, 6/n)} 

8. ht ■'^ avgminfi^v UB{h,Q, 6/ n) 

9. A ^ ( UB{ht, Q, 6/n) - mmtev LB{h, Q, 6/n))FiR) 

10. Return /i„ = where t = argmin^gii 2,...,n} A 

The intuitive motivation behind the algorithm is the following. It focuses 
on reducing the set of candidate hypotheses V, while being careful not to 
throw away the best classifier h^- = argmin^gc er(/i) (supposing, for this 
informal explanation, that /ij exists). Given that this is satisfied at any 
given time in the algorithm, it makes sense to focus our samples to the 
region DlS(y), since a classifier hi £V has smaller error rate than another 
classifier /12 S F if and only if it has smaller conditional error rate given 
DlS(y). For this reason, on each round, we seek to remove from V any h for 
which our confidence bounds indicate that er{h\ D1S{V)) > er(/i^| DlS(y)). 
However, so that we can make use of known results for i.i.d. samples, we 
freeze the sampling region R 5 DlS(y) and collect an i.i.d. sample from the 
conditional given this region, updating the region only when doing so allows 
us to further significantly focus the samples; for this same reason, we also 
reset the collection of samples Q every time we update the region R, so that 
it represents samples from the conditional given R. Finally, we maintain 
the values /3f, which represent confidence upper bounds on er{ht) — u = 
{er{ht\R) — er[K^\R))W[R), and we return the ht minimizing this confidence 
bound; note that it does not suffice to return since the final Q set might 
be small. 

As long as the confidence bounds UB and LB satisfy (overloading nota- 
tion in the natural way) 

Pz~©-{V/i G C, LB{h, Z, 5') < erv{h) < UB{h, Z, 5')} >l-6' 

for any distribution T> over X xy and any 5' G (0,1), and UB and LB 
converge to each other as m grows, it is known that a 1 — 6 confidence 
bound on er{hn) — v converges to [6]. For instance, Balcan, Beygelzimer 
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and Langford [6] suggest defining these functions based on classic results on 
uniform convergence rates in passive learning [30], such as 

UB{h, Q, 5') = min{erQ(/i) + G(|Q|, 5'), 1}, 

(1) 

L5(/i,Q,5') = max{erQ(/i)-G(|Q|,<5'),0}, 

where G{m, 5') = ^ + ^ ln(4/(5 )+dln(2em yd)^ m > d, and by convention 

G{m, 6') = oo for m < d. This choice of UB and LB is motivated by the 
following lemma, due to Vapnik [31]. 

Lemma 4. For any distribution D over X x y, and any 5' G (0, 1) and 
m £ N, with probability > 1 — 6' over the draw of Z ^ 2?™, every h £ C 
satisfies 

(2) \erz{h)- erv{h)\<G{m,6'). 

To avoid computational issues, instead of explicitly representing the sets 
V and R, we may implicitly represent them as a set of constraints im- 
posed by the condition in step 7 of previous iterations. We may also replace 
P(DIS(y)) and F{R) by estimates, since these quantities can be estimated 
to arbitrary precision with arbitrarily high confidence using only unlabeled 
data. Specifically, the convergence rates proven below can be preserved up 
to constant factors by replacing these quantities with confidence bounds 
based on a finite number of unlabeled data points; the details of this are in- 
cluded in Appendix C of the supplementary material [20]. As for the number 
of unlabeled data points required by the above algorithm itself, note that 
if P(DIS(y)) becomes small, it will use a large number of unlabeled data 
points; however, P(DIS(y)) being small also indicates er(/i„) — i/ is small 
(and indeed I3t). In particular, to get an excess error rate of e, the algorithm 
will generally require a number of unlabeled data points only polynomial 
in 1/e; also, the condition in step 4 guarantees the total number of unla- 
beled data points used by the algorithm is bounded with high probability. 
For comparison, recall that passive learning typically requires a number of 
labeled data points polynomial in 1/e. 

3.2. Algorithm 2. The second noise-robust algorithm we study was orig- 
inally proposed by Dasgupta, Hsu and Monteleoni [14]. It uses a type of 
constrained passive learning subroutine, Learn, defined as follows for two 
sets of labeled data points, C and Q. 

Learnc(/2, Q) = argmin erqQi). 

heC: erc{h)=0 

By convention, if no /i G C has erc{h) = 0, Learnc(>C, Q) = 0. The algo- 
rithm is formally defined below, in terms of a sequence of estimators A^, 
defined later. 
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Algorithm 2 

Input: hypothesis class C, label budget n, confidence 5, functions Am 
Output: classifier sets of (index, label) pairs £, and Q 

0. £^ 0, Q ^ 

1. For m = 1,2, . . . 

2. If \Q\ = n OT m > 2", Return /i„ = Learnc(/^, Q) along with £ and 

Q 

3. For each y G {-1,+1}, let h^^'^ = Learnc(>C U {(m, y)}, Q) 

4. If some y has /i^"^) = or 

5. Then £^£u{(m,y)} 

6. Else Request the label Ym and let Q QU {{m,Ym)} 

The algorithm maintains two sets of labeled data points: C and Q. The 
set Q represents points of which we have requested the labels. The set C 
represents the remaining points, and the labels of points in C are inferred. 
Specifically, suppose (inductively) that at some time m we have that every 
{i,y) G C has h'^{Xi) = y, where /i^ = argmiu/jgc er{h) (supposing the min 
is achieved, for this informal motivation). At any point, we can be fairly 
confident that /i^ will have relatively small empirical error rate. Thus, if all 
of the classifiers h with ercQi) = and h{Xm) = —y have relatively large 
empirical error rates compared to some h with erjr{h) = and h{Xm) = y, 
we can confidently infer that h^{Xm) = y- Note that this is not the true label 
Ym, but a sort of "denoised" version of it. Once we infer this label, since we 
are already confident that this is the label, and /i^ is the classifier we wish 
to compete with, we simply add this label as a constraint: that is, we require 
every classifier under consideration in the future to have h{Xm) = h^{Xm)- 
This is how elements of C are added. On the other hand, if we cannot 
confidently infer h^{Xm), because some classifiers labeling Xm. as —h^{Xm) 
also have relatively small empirical error rates, then we simply request the 
label Ym and add it to the set Q. Note that in order to make this comparison, 
we needed to be able to calculate the differences of empirical error rates; 
however, as long as we only consider the set of classifiers h that agree on 
the labels in C, we wih have ercuqihi) - er£uQ(^2) = erm{hi) - erm{h2), 
for any two such classifiers hi and /12, where m = \CyjQ\. 

The key to the above argument is carefully choosing a threshold for how 
large the difference in empirical error rates needs to be before we can con- 
fidently infer the label. For this purpose. Algorithm 2 is defined in terms 
of a function, Am('C, Q, h^y\ h^~y\6), representing a threshold for a type of 
hypothesis test. This threshold must be set carefully, since the sequence of 
labeled data points corresponding to £ U Q is not actually an i.i.d. sample 
from VxY- Dasgupta, Hsu and Monteleoni [14] suggest defining this function 
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as 

(3) A„, {C,QM''\ h^-^'^ ,5) = /3l + {3mi /^^W^ + ^J ercuqih^-y^)) , 

where (3m = ^ ^'"(^™'(™+^^^('^'^"^) Z*^) and S{C, 2m) is the shatter coefficient 
(e.g., [15, 31]); this suggestion is based on a confidence bound they derive, 
and they prove the correctness of the algorithm with this definition, meaning 
that the 1 — 5 confidence bound on its error rate converges to as n — )• oo. For 
now we will focus on the first return value (the classifier) , leaving the others 
for Section 5, where they will be useful for chaining multiple executions 
together. 

4. Convergence rates. In both of the above cases, one can prove guaran- 
tees stating that neither algorithm's convergence rates are ever significantly 
worse than passive learning by empirical risk minimization [6, 14]. However, 
it is even more interesting to discuss situations in which one can prove error 
rate guarantees for these algorithms significantly better than those achiev- 
able by passive learning. In this section, we begin by reviewing known re- 
sults on these potential improvements, stated in terms of the disagreement 
coefficient; we then proceed to discuss new results for Algorithm 1 and a 
novel variant of Algorithm 2, and describe the convergence rates achieved 
by these methods in terms of the disagreement coefficient and Tsybakov's 
noise conditions. 

To simplify the presentation, for the remainder of this paper we will re- 
strict the discussion to situations with 9 > Q (and therefore C with (i > 
too). Handling the extra case of = is a trivial matter, since = would 
imply that any proper learning algorithm achieves excess error for all 
values of n. 

4.1. The disagreement coefficient and active learning: Basic results. Be- 
fore going into the results for general distributions T)xy on X xy/\t will be 
instructive to first look at the special case when the noise rate is zero. Un- 
derstanding how the disagreement coefficient enters into the analysis of this 
simpler case may aid in digestion of the theorems and proofs for the general 
case presented later, where it plays an essentially analogous role. Most of 
the major ingredients of the proofs for the general case can be found in this 
special case, albeit in a much simpler form. Although this result has not 
previously been published, the proof is essentially analogous to (one case 
of) the analysis of Algorithm 1 in [19]. 

Theorem 1. Let f £C be such that er{f) = and 6*/ < oo. V?i G N and 
6 G (0, 1), with probability >1 — 6 over the draw of the unlabeled data, the 
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classifier hn returned by Algorithm after n label requests satisfies 



er{hn) < 2 • exp 



n 



126f{dln{22ef) + ln{3n/5)) 



Proof. As in the algorithm, let Vj denote the set of classifiers in C 
consistent with the first t label requests. If P(DIS(Vt)) > for all values of 
t in the algorithm, then with probability 1 the algorithm uses all n label 
requests. Technically, each claim below should be followed by the phrase, 
"unless P(DIS(Vj)) = for some t < n, in which case er[hn) = so the 
bound trivially holds." However, to simplify the presentation, we will make 
this special case implicit, and will not mention it further. 

The high-level outline of this proof is to use P(DIS(Vt)) as an upper bound 
on sup/jgy^ er(/i), and then show P(DIS(Vt)) is halved roughly every A = 
0{9fd) label requests. Thus, after roughly 0{9fd\og{l/e)) label requests, 
any h£Vt should have er{h) < e. 

Specifically, let = \86 f{dln{8e6 f) + ln(2n/(5))] . If n < A^, the bound in 
the theorem statement trivially holds, since the right-hand side exceeds 1; 
otherwise, consider some nonnegative t <n — Xn and t' = t + X^. Let X^t 
denote the point corresponding to the tth label request, and let X^^, denote 
the point corresponding to label request number t' . It must be that 

\{Xmt+l:^m.t+2, ■ ■ ■ , ^m^, } H DIS(Vt)| > A„, 

which means there is an i.i.d. sample of size A„, with distribution equivalent 
to the conditional of X given {X G DIS(Vj)}, contained in {Xm^+i, • • • , Xm^,}: 
namely, the first A„ points in this subsequence that are in DlS(Vt). 

Now recall that, by classic results from the passive learning literature (e.g., 
[5]), this implies that on an event Eg^t holding with probability 1 — S/n, 

sup .r(/>|DIS(F.)) < /H2eK/dHH2n/S) 

h€Vt, Xn 

Also note that A„ was defined (with express purpose) so that 
^ (iln(2eAJ(i) + ln(2n/,5) ^ ^^^^^^^^ 

An 

Recall that, since er(/) = 0, we have er{h) = F{h{X) / /(X)). Since / G 
Vf ^ Vt, this means for any h GVf we have {x : h{x) / f{x)} C DIS(T4), and 
thus 

sup F{hiX) / fix)) = sup F{h{X) / f{X)\X G DIS(yt))P(DIS(Vt)) 

= sup er(/i|DIS(14))P(DIS(yt))<P(DIS(Ft))/(20y). 
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So Vt> C B(/,P(DIS(T4))/(2^/)), and therefore by monotonicity of P(DIS(-)) 
and the definition of 0j 

P(DIS(140) <lP(DIS(i?(/,nDISCl/t))/(20/)))) <P(DIS(T4))/2. 

By a union bound, Es^t holds for every t £ {iXn :i £ {0,1, . . . , [n/Xn\ — 1}} 
with probability >1 — 6. On these events, if n > A,„[log2(l/e)] , then (by 
induction) 

sup er{h) < P(DIS(1/„)) < e. 
heVn 

Solving for e in terms of n gives the result (with a slight increase in constants 
due to relaxing the ceiling functions). □ 

4.2. Known results on convergence rates for agnostic active learning. We 
will now describe the known results for agnostic active learning algorithms, 
starting with Algorithm 1. The key to the potential convergence rate im- 
provements of Algorithm 1 is that, as the region of disagreement R decreases 
in measure, the error difference er{h\R) — er{h'\R) of any classifiers h,h' £V 
under the conditional sampling distribution (given R) can become signifi- 
cantly larger [by a factor of P(i?)~^] than er(h) — er{h'), making it signifi- 
cantly easier to determine which of the two is worse using a sample of labeled 
data. In particular, [19] developed a technique for analyzing this type of al- 
gorithm, and adapting that analysis to the above definition of Algorithm 1 
results in the following guarantee. 

Theorem 2 [19]. Let /i„ be the classifier returned by Algorithm 1 when 
allowed n label requests, using the bounds (1) and confidence parameter 
6 G (0,1/2). Then there exists a finite universal constant c such that, with 
probability > 1 — 5, Vn G N, 



er{hn) — u <c 



v'^e'^{d\ogn + \og{l/8))\og{{n + 2v9)/{ue)) 



+ 2exp 



n 

n 



ce'^{d\og9 + \og{n/6)) 



Similarly, the key to improvements from Algorithm 2 is that as the number 
m of processed unlabeled data points increases, we only need to request 
the labels of those data points in the region of disagreement of the set 
of classifiers with near-optimal empirical error rates. Thus, if the region 
of disagreement of classifiers with excess error < e shrinks as e shrinks, 
we expect the frequency of label requests to shrink as m increases. Since 
we are careful not to discard the best classifier, and the excess error rate 
of a classifier can be bounded in terms of the A^, function, we end up 
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with a bound on the excess error which is converging in m, the number of 
unlabeled data points processed, even though we request a number of labels 
growing slower than m. When this situation occurs, we expect Algorithm 
2 will provide an improved convergence rate compared to passive learning. 
Dasgupta, Hsu and Monteleoni [14] prove the following convergence rate 
guarantee. 

Theorem 3 [14]. Let hn he the classifier returned by Algorithm 2 when 
allowed n label requests, using the threshold (3), and confidence parameter 
5g (0,1/2). Then there exists a finite universal constant c such that, with 
probability >1 — 5, Vn E N, 



Note that, among other changes, this bound improves the dependence 
on the disagreement coefficient 6, compared to the bound for Algorithm 1. 
In both cases, for certain ranges of 9, v and n, these bounds can represent 
significant improvements in the excess error guarantees, compared to the cor- 
responding guarantees possible for passive learning. However, in both cases, 
when u > these bounds have an asymptotic dependence on n of 0(n~^/^), 
which is no better than the convergence rates achievable by passive learning 
(e.g., by empirical risk minimization). Thus, there remains the question of 
whether either algorithm can achieve asymptotic convergence rates strictly 
superior to passive learning for distributions with nonzero noise rates. This 
is the topic we turn to next. 

4.3. Active learning under Tsybakov's noise conditions. It is known that 
for most nontrivial C, for any n and > 0, for every active learning algorithm 
there is some distribution with noise rate ly for which we can guarantee 
excess error no better than oc z^n~"^/^ [21]; that is, the n~^/^ asymptotic 
dependence on n in the above bounds matches the corresponding minimax 
rate, and thus cannot be improved as long as the bounds depend on T>xy 
only via (and 9). Therefore, if we hope to discover situations in which 
these algorithms have strictly superior asymptotic dependence on n, we will 
need to allow the bounds to depend on a more detailed description of the 
noise distribution than simply the noise rate v. 

As previously mentioned, one way to describe a noise distribution us- 
ing a more detailed parametrization is to use Tsybakov's noise conditions 
(Condition 1). In the context of passive learning, this allows one to describe 
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situations in which the rate of convergence is between n~'^ and n^"*^/^, even 
when z/ > 0. This raises the natural question of how these active learning 
algorithms perform when the noise distribution satisfies this condition with 
finite /i and k parameter values. In many ways, it seems active learning 
is particularly well-suited to exploit these more favorable noise conditions, 
since they imply that as we eliminate suboptimal classifiers, the diameter 
of the remaining set shrinks; thus, for finite 6 values, the region of disagree- 
ment should also be shrinking, allowing us to focus the samples in a smaller 
region and accelerate the convergence. 

Focusing on the special case of learning one-dimensional threshold clas- 
sifiers under a certain uniform marginal distribution, Castro and Nowak 
[12] studied conditions related to Condition 1. In particular, they studied 
a threshold-learning algorithm that, unlike the algorithms described here, 
takes K as input, and found its convergence rate to be oc (12£Z!;)k/(2k-2) ^j^gj^ 
K > 1, and exp{— cn} for some (^-dependent) constant c, when k = 1. Note 
that this improves over the n"'^/^^'^"^-' rates achievable in passive learn- 
ing [12, 28]. Subsequently, Balcan, Broder and Zhang [7] proved an anal- 
ogous positive result for higher-dimensional linear separators, and Castro 
and Nowak [12] additionally showed a related result for boundary fragment 
classes (see below); in both cases, the algorithm depends explicitly on the 
noise parameters. Later, in a personal communication, Langford and Castro 
claimed that in fact Algorithm 1 achieves this rate (up to log factors) for 
the one-dimensional thresholds problem, leading to speculation that perhaps 
these improvements are achievable in the general case as well (under con- 
ditions on the disagreement coefficient). Castro and Nowak [12] also prove 
that a value oc n~^l^'^'^~'^^ (or exp{ — c'n}, for some c', when k = 1) is also 
a lower bound on the minimax rate for the threshold learning problem. In 
fact, a similar proof to theirs can be used to show this same lower bound 
holds for any nontrivial C. For completeness, a proof of this more general 
result is included in Appendix D of the supplementary material [20]. 

Other than the few specific results mentioned above, it was not previously 
known whether Algorithm 1 or Algorithm 2, or indeed any active learning 
algorithm, generally achieves convergence rates that exhibit these types of 
improvements. 

4.4. Adaptive rates in active learning: Algorithm 1. The above observa- 
tions open the question of whether these algorithms, or variants thereof, 
improve this asymptotic dependence on n. It turns out this is indeed possi- 
ble. Specifically, we have the following result for Algorithm 1. 

Theorem 4. Let hn be the classifier returned by Algorithm 1 when 
allowed n label requests, using the bounds (1) and confidence parameter 
6 G (0,1/2). Suppose further that T)xy satisfies Condition 1. Then there 
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exists a finite (n- and ^-dependent) constant c such that, for any n G N, 
with probability >1 — 6, 



er(hn) — u < < 



— ■^T7T\ Tl — n~mYr' when k = 1, 

[ c6'^(alogn + log(l/())) J 

&2(dlogn + log(l/5)) XognY'^'^^-^'^ 

I , whenK>l. 

n 



Proof. We will proceed by bounding the label complexity, or size of the 
label budget n that is sufficient to guarantee, with high probability, that the 
excess error of the returned classifier will be at most e (for arbitrary e > 0); 
with this in hand, we can simply bound the inverse of the function to get 
the result in terms of a bound on excess error. 

Throughout this proof (and proofs of later results in this paper), we will 
make frequent use of basic facts about er{h\R). In particular, for any classi- 
fiers h, h' and set RCX,we have er(/i) = er{h\R)F{R) + er{h\X \ R)¥{X \ 
R); also, if {x : h{x) ^ h'{x)} C R, we have er{h\X \ R) - er{h'\X \R) = 
and therefore er{h) - er{h') = {er{h\R) - er{h'\R))F{R). 

Note that, by Lemma 4 and a union bound, on an event of probability 
1 — 6, (2) holds with 5' = 5/n for every set Q, relative to the conditional 
distribution given its respective R set, for any value of n. For the remainder 
of this proof, we assume that this 1 — 5 probability event occurs. In par- 
ticular, this means that for every /i G C and every Q set in the algorithm, 
LB{h,Q,6/n) < er{h\R) < UB{h,Q,6/n), for the set R that Q is sampled 
under. 

Our first task is to show that we never remove the "good" classifiers from 
V. We only remove a classifier h from V if h' = argmin^/gy UB{h' ,Q,6/n) 
has LB{h,Q,6/n) > UB{h' ,Q,6/n). Each h e V has {x:h{x) / h'{x)} C 
DIS(F) C R, so that 

UB{h', Q, 5 In) - LB{h, Q, 5/n) > er{h'\R) - er{h\R) = '''^^1''''^^^ 

Thus, for any h€V with er(/i) < er(/i'), UB{h',Q,5/n) - LB{h,Q,5/n) > 
er{h'\R) — er{h\R) = {er{h') — er{h))/W{R) > 0, so that on any given round 
of the algorithm, the set {hGV : er{h) < er(h')} is not removed from V. In 
particular, since we always have er[h') > v, by induction this implies the 
invariant vaih^y er[h) = v, and therefore also 

Vt eriht) — V = er(ht) — inf er(h) 

h&V 

er{ht\R) - inf er{h\R)]¥{R) < fit. 
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where again the second equahty is due to the fact that V/i GV, {x: ht{x) ^ 
h^x)} C DlS(y) C R. We wiU spend the remainder of the proof bound- 
ing the size of n sufficient to guarantee some Pt < £• In particular, sim- 
ilar to the proof of Theorem 1, we will see that as long as /3f > e, we 
will halve P(DIS(F)) roughly every O(0^(ie^/^~^) label requests, so that 
the total number of label requests before some /3( < e is at most roughly 

Recalling the definition of h^^^ (from Definition 2), let 

(4) y(^) = {heV: limsupP(/i(X) / /i^ (X)) > HI j . 

Note that after step 7, if V^^'^ = 0, then 
P(DIS(y)) <p(DIs({^GC:limsupP(/i(X) <P(i?)/(2^)} 

= lim P(DIsf n B{h^''\F{R)/{2e)) 
^ ^k>k' 



< 



lim P( n DIS(S(/iW,P(i?)/(2e))) 



k>k' 



<liminfP(DIS(B(/i[*^],P(i?)/(20)))) 

k—^oo 

F(R) F(R) 

so that we will satisfy the condition in step 2 on the next round. Here we 
have used the definition of 9 in the final inequality and equality. On the 
other hand, if after step 7, we have V^^^ ^ 0, then 

0^\heV: limsupP(/i(X) / /i^ {X)) - ^^^^ 



> 99 



lim sup,^^ FjhjX) / /.W {X)) y fF{R) \ ^ 
/. J \2f,9 J 



h£V: 



^ j^^^.^,^ diam(erW-i/;C) y^ fF{R)\ 



/F{R)Y 



C ^heV:er{h)-u> 
= ^h£V:er{h\R) - er{h'\R) >F{R)^~^{2n9y^ 

c\h€V: UB(h,Q,S/n) - min LB(h' ,Q,6/n) > F(R)''-^(2fi9)-''\ 
L h'ev ) 
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C \heV:LB(h,Q,6/n) - min UB(h',Q,6/n) 

L h'&V 

> F{R)''-\2fie)-'' - AGi\Q\,6/n)y 

Here, the third hne follows from the fact that er{h^^^) < er{h) for all suffi- 
ciently large /c, the fourth line follows from Condition 1, and the final line 
follows from the definition of UB and LB. By definition, every h £V has 
LB(h,Q,6/n) < min^/gy UB{h',Q,6/n), so for this last set to be nonempty 
after step 7, we must have P(i?)''-i(2/x6')-'^ < 4G{\Q\,6/n). 
Combining these two cases 

(yW = and /0), since \Q\ gets reset 
to upon reaching step 3, we have that after every execution of step 7, 

(5) P(i?)''-i(2/^^)"" < 4G(|g| - 1, 6/n). 

^^HR) < 2G{\Q\-i,5/n) ^ 2G{\Q\,5/n) ^ ^^^n certainly A < e (by definition of 
Pt ^ 2G{\Q\,5/n)¥{R)). So on any round for which /3j > e, we must have 

2C(|Q|!M/n) <"'^>- 
Combining (5) and (6), on any round for which Pt > 

( 2G(|Q|-M/n) )'"'p'""-'<''°'l'?|-'-^/"'- 

Solving for — l,S/n) reveals that when Pt > 

(8) ^''^'^{^Y '^^^2/.0)"'<^(l^|-l''^A^)- 

Basic algebra shows that when n > \Q\ > d, we have 



Gi\Q\-l,6/n)<3^ 



/ln(4/(5) + (d + l)ln(n) 
\Q\ 



Combining this with (8), solving for \Q\ and adding d to handle the case 
\Q\ ^d, we have that on any round for which /3t > e, 

(9) 1^1^(7) (6^e)'42/''nn- + (d+l)ln(n)j +d. 

Since Pt < lP(i?) by definition, and ¥{R) is at least halved each time we 
reach step 3, we need to reach step 3 at most [log2(l/e)] times before we 
are guaranteed some Pt<£- Thus, any 

/ /oX / 4 \ \ 2 

(10) n>l+n-j (6/x0)242/''nn- + (d+l)ln(n) j log2- 
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suffices to guarantee either some |Q| exceeds (9) or we reach step 3 at least 
[log2(l/e)] times, either of which impHes the existence of some /3t < e. The 
stated result now follows by basic inequalities to bound the smallest value 
of £ satisfying (10) for a given value of n. □ 

If the disagreement coefficient is finite, Theorem 4 can often represent a 
significant improvement in convergence rate compared to passive learning, 
where we typically expect rates of order 7t,-«^/(2k-i) J]^2, 26, 28]; this gap is 
especially notable when the disagreement coefficient and k are small. Fur- 
thermore, the bound matches (up to logarithmic factors) the form of the 
minimax rate lower bound proved by Castro and Nowak [12] for threshold 
classifiers (where 6 = 2); as mentioned, that lower bound proof can be gen- 
eralized to any nontrivial C (see Appendix D of the supplementary material 
[20] ) , so that the rate of Theorem 4 is nearly minimax optimal for any non- 
trivial C with bounded disagreement coefficients. Also note that, unlike the 
upper bound analysis of Castro and Nowak [12], we do not require the al- 
gorithm to be given any extra information about the noise distribution, so 
that this result is somewhat stronger; it is also more general, as this bound 
applies to an arbitrary hypothesis class. 

A refined analysis and minor tweaks to the algorithm should be able to 
reduce the log factors in this result. For instance, defining UB and LB 
using the uniform convergence bounds of Alexander [1] , and using a slightly 
more complicated algorithm closer to the original definition [6, 19] — taking 
multiple samples between bound evaluations, allowing a larger confidence 
argument to the UB and LB evaluations — the log^n factor should reduce 
at least to log n log log n, if not further. Also, as previously mentioned, it is 
possible to replace the quantities ¥{R) and P(DIS(y)) in Algorithm 1 with 
estimators of these quantities based on a finite sample of unlabeled data 
points, while preserving the results of Theorem 4 up to constant factors. We 
include an example of such estimators in Appendix C of the supplementary 
material [20] , along with a sketch of how to modify the proof of Theorem 4 
to compensate for using these estimated probabilities. 

4.5. Adaptive rates in active learning: Algorithm 2. Note that, as before, 
n gets divided by 6'^ in the rates achieved by Algorithm 1. As before, it is 
not clear whether any modification to the definitions of UB and LB can 
reduce this exponent on 9 from 2 to 1. As such, it is natural to investigate 
the rates achieved by Algorithm 2 under Condition 1; we know that it does 
improve the dependence on 6 for the worst case rates over distributions with 
any given noise rate, so we might hope that it does the same for the rates 
over distributions with any given values of and k. Unfortunately, we do 
not presently know whether the original definition of Algorithm 2 achieves 
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this improvement. However, we now present a slight modification of the al- 
gorithm, and prove that it does indeed provide the desired improvement 
in dependence on 0, while maintaining the improvements in the asymp- 
totic dependence on n. Specifically, consider the following definition for the 
threshold in Algorithm 2: 



where £c(-,-;") is defined in the Appendix, based on a notion of local 
Rademacher complexity studied by Koltchinskii [23]. In particular, the quan- 
tity £c is known to be adaptive to Tsybakov's noise conditions, in the sense 
that more favorable noise conditions yield smaller values of £c- Using this 
definition, we have the following theorem; due to space limitations, its proof 
is not presented here, but is included in Appendix B of the supplementary 
material [20]. 

Theorem 5. Suppose hn is the classifier returned by Algorithm 2 with 
threshold as in (11), when allowed n label requests and given confidence pa- 
rameter (5 G (0, 1/2). Suppose further that T>xy satisfies Condition 1 with fi- 
nite parameter values k and ^. Then there exists a finite (k and fx- dependent) 
constant c such that, with probability >1 — S, Vn G N, 



Note that this does indeed improve the dependence on 6, reducing its 
exponent from 2 to 1; we do lose some in that there is now a square root 
in the exponent of the k = 1 case; however, as with Theorem 4, it is likely 
that slight refinements to the definition of Am would reduce this (though 
we may also need to weaken the theorem statement to hold for any single 
n, rather than simultaneously for all n). 

The bound in Theorem 5 is stated in terms of the VC dimension d. How- 
ever, for certain nonparametric hypothesis classes, it is sometimes preferable 
to quantify the complexity of the class in terms of a constraint on the en- 
tropy of the class, relative to the distribution Vxv (see e.g., [12, 23, 28, 29]). 
Specifically, for e G [0, 1], define 

wc(m,e)=E sup |(er(/ii) - erm(/ii)) - (er(/i2) - erm(/i2))|- 



(11) 



A„(£, Q, h^y^ , h^-y^ , 5) = 38c{C U Q, 6; £), 




when K> 1. 



¥{h-i{X)^h2{X)}<e 
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Condition 2. There exist finite constants a > and p G (0, 1) s.t. Vm € 
N and eG [0,1], wc(m,e) < a • max{e(i-'')/2m^i/2^ 771^^/(1+^)}. 

In particular, the entropy with bracketing condition used in the original 
minimax analysis of Tsybakov [28] implies Condition 2 [23], as does the 
analogous condition for random entropy [17, 18, 24]. In passive learning, it is 
known that empirical risk minimization achieves a rate of order j^-'^/CSk+p-i) 
under Conditions 1 and 2 [23, 24] (see also Appendix B of the supplementary 
material [20], especially (19) and Lemma 5), and that this is sometimes 
minimax optimal [28]. The following theorem gives a bound on the rate of 
convergence of the same version of Algorithm 2 as in Theorem 5, this time 
in terms of the entropy condition which, as before, is faster than the passive 
learning rate when the disagreement coefficient is finite. The proof of this 
result is included in Appendix B of the supplementary material [20]. 

Theorem 6. Suppose hn is the classifier returned by Algorithm 2 with 
threshold as in (11), when allowed n label requests and given confidence 
parameter 6£ (0,1/2). Suppose further that T>xY satisfies Condition 1 with 
finite parameter values k and p, and Condition 2 with parameter values a 
and p. Then there exists a finite (k, p, a and p-dependent) constant c such 
that, with probability > 1 — (5, Vn G N, 

er{hn)-u<ci ^^-^j 

Again, it is likely that refinements to the definition may lead to 
improvements in the log factor. Also, although this result is stated for Algo- 
rithm 2, it is conceivable that, by modifying Algorithm 1 to use definitions 
of V and (3t based on 8,c{Q,S;0), an analogous result might be possible for 
Algorithm 1 as well. 

It is worth mentioning that Castro and Nowak [12] proved a minimax 
lower bound for the hypothesis class of boundary fragments, with an expo- 
nent having a similar dependence on related definitions of k, and p param- 
eters to that of Theorem 6. Their result does provide a valid lower bound 
here; however, it is not clear whether their lower bound. Theorem 6, both, 
or neither is tight in the present context, since the value of 6 is not presently 
known for that particular problem, and the matching upper bound of [12] 
was proven under a stronger restriction on the noise than Condition 1 . How- 
ever, see [33] for an analysis of the disagreement coefficient for other non- 
parametric hypothesis classes, characterized by smoothness of the decision 
surface. 
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5. Model selection. While the previous sections address adaptation to 
the noise distribution, they are still restrictive in that they deal with hy- 
pothesis classes of limited expressiveness. That is, the assumption of finite 
VC dimension implies a strong restriction on the variety of classifiers one 
can represent (or approximate) in the class; the entropy conditions allow 
slightly more fiexibility, but under nontrivial distributions, even the entropy 
conditions imply a significant restriction on the expressiveness of the class. 
Thus, for algorithms restricted to classifiers from such a restricted hypothe- 
sis class, it is often unrealistic to expect convergence to the Bayes error rate. 
We address this issue in this section by developing a general algorithm for 
learning with a sequence of nested hypothesis classes of increasing complex- 
ity, similar to the setting of Structural Risk Minimization in passive learning 
[30]. The objective is to adapt, not only to the noise conditions, but also 
to the complexity of the optimal classifier. The starting point for this dis- 
cussion is the assumption of a structure on C, in the form of a sequence of 
nested hypothesis classes: 

Ci c C2 c • • • . 

Each class has an associated noise rate = inf/jgc. er{h), and we define 
Voo = finij_>.oo fj. We also let 9i and di be the disagreement coefficient and 
VC dimension, respectively, for the set Cj. We are interested in an algorithm 
that guarantees convergence in probability of the error rate to z^oo- We are 
particularly interested in situations where 1^00 = i'* , & condition which is 
realistic in this setting since the sets Cj can be defined so that it is always 
satisfied, even while maintaining each di < 00 (see, e.g., [15]). Additionally, if 
we are so lucky as to have some Vi = v*, then we would like the convergence 
rate achieved by the algorithm to be not significantly worse than running 
one of the above agnostic active learning algorithms with hypothesis class 
Cj alone. In this context, we can define a structure-dependent version of 
Tsybakov's noise condition as follows. 

Condition 3. For some nonempty / C N, for each i £ I, there exist 
finite constants /ij > and Ki > 1, such that Ve > 0,diam(e;Ci) < fiiS^^'^^. 

Note that we do not require every Cj, i G N, to have finite /ij and Ki, only 
some nonempty set / C N; this is important, since we might not expect Cj 
to satisfy Condition 1 for small indices i, where the expressiveness is quite 
restricted. 

In passive learning, there are several methods for this type of model se- 
lection which are known to preserve the convergence rates of each class Cj 
under Condition 3 (e.g., [23, 28]). In particular, Koltchinskii [23] develops a 
method that performs this type of model selection; it turns out we can mod- 
ify Koltchinskii's method to suit our present needs in the context of active 
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learning; this results in a general active learning model selection method 
that preserves the types of improved rates discussed in the previous section. 
This modification is presented below, based on using Algorithm 2 as a sub- 
routine. (It should also be possible to define an analogous method that uses 
Algorithm 1 subroutine instead.) 



Algorithm 3 

Input: nested sequence of classes {C,}, label budget n, confidence 
parameter 6 
Output: classifier hn 



0. For f= LvW2javW2j -l,LvW2j -2,...,1 

1. Let Cin and Qm be the sets returned by Algorithm 2 run with Cj 
and the 

threshold (11), allowing [n/(2i^)J label requests, and confidence 

2. Let hin ^ LEARNq (Uj>i Cjn, Qin) 

3. If hin 7^ and Vj s.t. i < j < [y^V^J ) 

4. hyi i hin 

5. Return hn 



The function £.(•,•;•) is defined in the Appendix. This method can be 
shown to have a confidence bound on its error rate converging to i^oo at a 
rate never significantly worse than the original passive learning method of 
Koltchinskii [23], as desired. Additionally, we have the following guarantee 
on the rate of convergence under Condition 3. The proof is similar in style to 
Koltchinskii's original proof, though some care is needed due to the altered 
sampling distribution and the constraint set Cjn- The proof is included in 
Appendix B of the supplementary material [20]. 

Theorem 7. Suppose hn is the classifier returned by Algorithm 3, when 
allowed n label requests and confidence parameter 6 £ (0, 1/2). Suppose fur- 
ther that T>xY satisfies Condition 3. Then there exist finite (Ki and fii- 
dependent) constants q such that, with probability > 1 — 5, Vn G N, 

er{hn) - h'oo 

< 3min(fj - z^oo) 



+ < 



^ V ''A- ^cMd^+lgil/6)) ' 'f^^ 



ei{d,logn + log{l/6)) \ 
n j 



Ki/{2ni-2) 



if Hi > 1. 
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In particular, if we are so lucky as to have vi = v* for some finite i, then the 
above algorithm achieves a convergence rate not significantly worse than that 
guaranteed by Theorem 5 for applying Algorithm 2 directly, with hypothesis 
class Cj. Note that the algorithm itself has no dependence on the set /, nor 
has it any dependence on each class's complexity parameters dj, Ki,fii, 6i; the 
adaptive behavior of the data-dependent bound £cj allows the algorithm to 
adaptively ignore the returned classifier from the runs of Algorithm 2 for 
which convergence is slow, thus automatically selecting an index for which 
the error rate is relatively small. 

As in the previous section, we can also show a variant of this result when 
the complexities are quantified in terms of the entropy. Specifically, consider 
the following condition and theorem; the proof is in Appendix B of the 
supplementary material [20]. Again, this represents an improvement over 
known results for passive learning when the disagreement coefficients are 
finite. 

Condition 4. For each i G N, there exist finite constants a,/ > 0, pi £ 
(0, 1) s.t. V?n G N and e G [0, 1], a;Q(m,e) < a.rmax{e(i-''')/2m~i/2,m-V{i+ft)}. 

Theorem 8. Suppose hn is the classifier returned by Algorithm 3, when 
allowed n label requests and confidence parameter 5 G (0, 1/2). Suppose fur- 
ther that T)xY satisfies Conditions 3 and 4- Then there exist finite (ni, Hi, 
Ui and Pi-dependent) constants Cj such that, with probability > 1 — 5, Vn G N, 



In addition to these theorems for this structure-dependent version of Tsy- 
bakov's noise conditions, we also have the following result for a structure- 
independent noise condition, in the sense that the noise condition does not 
depend on the particular choice of Cj sets, but only on the distribution 
T^XY (and in some sense, the full class C = lJjQ); it may be particularly 
useful when the class C is universal, in the sense that it can approximate 
any classifier. 

Theorem 9. Suppose the sequence {Cj} is constructed so that u^o = f* , 
and hn is the classifier returned by Algorithm 3, when allowed n label requests 
and confidence parameter 5 G (0,1/2). Suppose that there exists a constant 
/i > s.t. for all measurable h:X ^y, er{h) - v* > pF{h{X) / h*{X)]. 
Then there exists a finite (p-dependent) constant c such that, with probability 
> 1 - Vn G N, 




erihn) -V* < cm.\n{vi - i/*) + ( + log - ) • exp 




ci 



i'^ei[di + \og{i/5)) 



n 
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The condition v^o = J^* is quite easy to satisfy: for example, Cj could be 
axis- aligned decision trees of depth i, or thresholded polynomials of degree 
i, or multi-layer neural networks with i internal units, etc. As for the noise 
condition in Theorem 9, this would be satisfied whenever P(|ry(X) — 1/2| > 
c) = 1 for some constant c € (0, 1 /2] . The case where er(/i) — u*> fi¥{h{X) ^ 
h*{X)}'^ for K > 1 can be studied analogously, though the rate improvements 
over passive learning are more subtle. 

6. Conclusions. Under Tsybakov's noise conditions, active learning can 
offer improved asymptotic convergence rates compared to passive learning 
when the disagreement coefficient is finite. It is also possible to preserve 
these improved convergence rates when learning with a nested structure of 
hypothesis classes, using an algorithm that adapts to both the noise condi- 
tions and the complexity of the optimal classifier. 

APPENDIX: DEFINITION OF £ AND RELATED QUANTITIES 

We define the quantity £c following Koltchinskii's analysis of excess risk in 
terms of local Rademacher complexity [23] . The general idea is to construct 
a bound on the excess risk achieved by a given algorithm, such as empirical 
risk minimization, via an application of Talagrand's inequality. Such a bound 
should be based on a measure of the expressiveness of the set of functions C; 
however, to bound the excess risk achieved by a particular algorithm given a 
number of data points, we need only measure the expressiveness of the set of 
functions the algorithm is likely to select from. For reasonable algorithms, 
such as empirical risk minimization, this means the set of functions with 
reasonably small excess risk. Thus, we can bound the excess risk of the 
algorithm in terms of a measure of expressiveness of the set of functions with 
relatively small risk, typically referred to as a local complexity measure. This 
reasoning is somewhat circular, in that first we must decide how small to 
expect the excess risk of the returned function to be before we can calculate 
the local complexity measure, which itself is used to calculate a bound on the 
risk of the returned function. Thus, we define the bound on the excess risk as 
a kind of fixed point. Furthermore, we can estimate these quantities using 
data-dependent confidence bounds, so that the excess risk bound can be 
calculated without direct access to the distribution. For the data-dependent 
measure of the expressiveness of the function class, we can use a Rademacher 
process. A detailed motivation and derivation can be found in [23]. 

For our purposes, we add an additional constraint, by requiring the func- 
tions we calculate the complexity of to agree with the labels of a labeled 
set C This is helpful for us, since given a set Q of labeled data with true 
labels, for any two functions hi and /12 that agree on the labels of C, it is 
always true that er£uQ(^i) — er/;uQ(^2) equals the difference of the true 
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empirical error rates. As we prove in the supplement, as long as the set L 
is chosen carefully (i.e., as in Algorithm 2), the addition of this constraint 
is essentially inconsequential, so that £c remains a valid excess risk bound. 
The detailed definitions are stated as follows. 

For any function / : — )■ M, and Ci) '^2, • • • a sequence of independent ran- 
dom variables with distribution uniform in {—1, +1}, define the Rademacher 
process for / under a finite set of (index, label) pairs S C N x 3^ as 

R{f;S) = -^ E ^*/(^*)- 

The should be thought of as internal variables in the learning algorithm, 
rather than being fundamental to the learning problem. 
For any two finite sets £ C N x 3^ and C N x 3^, define 

C[£] = {heC:erc{h) = 0}, 
C(e;C,S) = \heC[C]:ers(h)- min ers(h')<e}, 



Dc{e;C,S)= sup ^ M^iiXi) ^ h2{Xi)] 

and 



151 



e;C,S) = ]- sup R{hi-h2;S). 

h^,h2&C(e;C,S) 

For 5, e > 0, m E N, define s^((5) = In ^O"'" i°g2(3m) 1^ = {j^Z: 2^ > e}, 
and for any set S* C N x 3^, define the set S^"^^ ={{i,y) £ S ■.i< m}. We use 
the following definitions from Koltchinskii [23] with only minor modifica- 
tions. 



Definition 3. For e G [0, 1], and finite sets S, £ C N x 3^, define 

s^sii6)Dc{ce;C,S) s^siiS) \ 
\S\ + |5| ) 

and 

Ec{S,S;C) = mf\e>0:yj eZs,mmUc(2\6;C^'"\ 5^'") ) < 2^ , 

I mGN J 

where, for our purposes, we can take K = 752 and c = 3/2, though there 
seems to be room for improvement in these constants. For completeness, we 
also define E,£,{0,5;C,C) = oo by convention. 



Uc{e,6;C,S) = k[ 4>c{ce;C,S) + 
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We will also define a related quantity, representing a distribution-dependent 
version of £, also explored by Koltchinskii [23]. Specifically, for e > 0, define 

C{e) = {heC:er{h)-v<e}. 

For m e N, let 

(j)c{m,e) = E sup |(er(/ii) - erm(/ii)) - (er(/i2) - erm(/i2))|, 

/il,/i2GC(e) 

fj, .X ^( , / ~ ^ , /smiS) diam(ce;C) Sm((^) \ 

;7c(m,e,5) =K[ (j)c{m,ce) + W \ 

\ \ m ml 

and 

£c(m, 5) = inf{e > : Vj G Z^, ^7c(m, 2^ 5) < 2^-^}, 

where, for our purposes, we can take K = 8272 and c = 3. For completeness, 
we also define £c(0, 6) = oo. 

A.l. Definition of tq. In Definition 1, we took rg = 0. If ^ < oo, then this 
choice is usually relatively harmless. However, in some cases, setting tq = 
results in a suboptimal, or even infinite, value of 0, which is undesirable. In 
these cases, we would like to set tq as large as possible while maintaining 
the validity of the bounds. If we do this carefully enough, we should be 
able to establish bounds that, even in the worst case when 6 = 1/ro, are 
never worse than the bounds for some analogous passive learning method; 
however, to do this requires tq to depend on the parameters of the learning 
problem: namely, n, 6, C and "Dxy- The effect of a larger tq can sometimes 
be dramatic, as there are scenarios where 1 <^6 <^l/rQ [8]; we certainly wish 
to distinguish between such scenarios, and those where 6 (xI/vq. 

Generally, depending on the bound we wish to prove, different values of 
ro may be appropriate. For the tightest bound in terms of 9 proven in the 
Appendices (namely, Lemma 7 of Appendix B in the supplementary material 
[20]), the definition of tq = rc(n, 5) in (13) below gives a good bound. For the 
looser bounds (namely, Theorems 5 and 6), a larger value of rg may provide 
better bounds; however, this same general technique can be employed to 
define a good value for tq in these looser bounds as well, simply using upper 
bounds on (13) analogous to how the theorems themselves are derived from 
Lemma 7 in Appendix B [20] . Likewise, one can state analogous refinements 
of ro for Theorems 1-4, though for brevity these are left for the reader's 
independent consideration. 

Definition 4. Define 

m G N : n < logg + 2e J]] P(DIS(C(6£c(^, -5)))) > 

1=0 J 

(12) 
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and 

{ mc(n,5)-l ^ 
^— — y diam(6£c(^,'^);C),2-'' I. 

We use this definition of rg = rc(n, 6) in all of the main proofs. In partic- 
ular, with this definition, Lemma 7 of Appendix B [20] is never significantly 
worse than the analogous known result for passive learning (though it can 
be significantly better when 9 <^ l/^o)- 
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SUPPLEMENTARY MATERIAL 

Proofs and Supplements for "Rates of Convergence in Active Learning" 

(DOI: 10. 1214/10- AOS843SUPP; .pdf). The supplementary material con- 
tains three additional Appendices, namely, Appendices B, C and D. Specif- 
ically, Appendix B provides detailed proofs of Theorems 5-9, as well as 
several abstract lemmas from which these results are derived. Appendix C 
discusses the use of estimators in Algorithm 1 . Finally, Appendix D includes 
a proof of a general minimax lower bound oc n~'^/(^'^~^) for any nontrivial 
hypothesis class, generalizing a result of Castro and Nowak [12]. 
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